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Abstract 

We present a very fast algorithm for general matrix factorization of a data 
matrix for use in the statistical analysis of high-dimensional data via latent 
factors. Such data are prevalent across many application areas and generate 
an ever- increasing demand for methods of dimension reduction in order to 
undertake the statistical analysis of interest. Our algorithm uses a gradient- 
based approach which can be used with an arbitrary loss function provided 
the latter is different iable. The speed and effectiveness of our algorithm 
for dimension reduction is demonstrated in the context of supervised clas- 
sification of some real high-dimensional data sets from the bioinformatics 
literature. 

Keywords: matrix factorization, non-negative matrix factorization, 
high- dimensional data, microarray gene-expression data, supervised 
classification 

1. Introduction 

We let cci , . . . , x n denote n observed p-dimensional observations, where 
the number of variables p is very large relative to n. For example, in the 
analysis of microarray gene-expression data, n (the number of tissues) might 
be only 50, whereas p (the number of genes) might be in the tens of thousands. 
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We follow the traditional biologists' practice of letting 

X = [Xi, x n ) 

be the pxn data matrix. The usual statistical practice is to take the transpose 
of X, X T , as the data matrix. Without loss of generality, we assume that 
the overall mean of X is zero. 

In most statistical analyses of the data matrix X, some form of dimension 
reduction is required, typically before the primary analysis is performed, or 
with some approaches it might be done in conjunction with the main analysis. 
In recent times, much attention has been given to matrix factorizations of 
the form, 

X = AB, (1) 

where A is a p x q matrix and B is a q x n matrix and where q is chosen to 
be much smaller than p. For a specified value of q, the matrices A and B 
are chosen to minimize 

\\X-AB\\ 2 , (2) 

where || • || is the Frobenius norm (the sum of squared elements of the matrix). 
With this factorization, dimension reduction is effected by replacing the data 
matrix X by the solution B for the factor matrix B; the ith row of B 
gives the values of the zth metavariable for the n entities. Thus the original 
p variables are replaced by q metavariables. When the elements of X are 
nonnegative, we can restrict the elements of A and B to be nonnegative. This 
approach is called nonnegative matrix factorization (NMF) in the literature 
(Lee and Seung, 1999). We shall call the general approach where there are 
no constraints on A and B, GMF (general matrix factorization). 

The classic method for factoring the data matrix X is singular-value 
decomposition (SVD, Golub and van Loan (1983)). It follows from this 
theorem that we can decompose X exactly into the form 

X = LDR T , (3) 

where L — (Zi, . . . , Z&) is a p x k matrix with orthonormal columns, -R = 
(7*1, . . . , r*fc) is a n x k matrix with orthonormal columns, D is a diagonal 
matrix with elements d\ > d% > ■ ■ ■ > dk > 0, and k < min(p, n) is the rank 
of X. For any q < k, 

q 

^2 dihrf = arg min ||X-X|| 2 , (4) 

i=l XdM{q) 
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where M(q) is the set of rank-g p x n matrices; see, for example, Eckart and 
Young (1936). 

Let L {q) = (l u . . . , l g ), R {q) = (n, . . . , r q ), and D {q) be the diagonal 
matrix with diagonal elements d\, . . . , d q . Then on considering the matrix 
factorization ([I]) of X, it follows from (jl]) that for a specified value of q we 
can find the factor matrices A and B that minimize by taking A = L^ q ' 
and B = D {q) R {q)T . 

The calculation of the exact SVD of the matrix X has time complexity 
0(mm{pn 2 , n 2 p}). Hence the use of SVD for high- dimensional data sets is 
not feasible and the use of the best g-approximation (jlj) for q larger enough 
to capture most of the variance in X requires essentially the same amount 
of time as the full SVD. 

Hence we consider a very fast approach to the general matrix factorization 
(JTJ), using a gradient-based algorithm applicable for an arbitrary (differen- 
tiable) loss function. In the sequel, we consider the exponential family of loss 
functions that include the least-squares loss function (j2J) as a limiting case. 
The novelty of our algorithm lies in the way that on each global iteration it 

(a) iterates on only a small subset of the elements of the factor matrix A 
with the other factor matrix B fixed before reversing their roles; 

(b) loops through all the terms in the objective function, minimizing them 
individually at a time rather than their total sum (that is, it adopts a 
stochastic gradient descent approach). 

As to be presented in Section 3, our algorithm takes only between 10 
and 15 seconds in performing 300 global iterations to provide a q = 11 rank 
factorization of a 2000 x 62 data matrix for the colon cancer data set of 
Alon et al. (1999). In contrast, 20 global iterations with non-negative matrix 
factorization (NMF) for the same task required about 25 minutes. 

The effectiveness of our algorithm is to be demonstrated in its application 
to provide a reduction in the number of genes for use in the formation of 
classifiers in the supervised classification of five well-known high- dimensional 
data sets in the bioinformatics literature. 

2. Background 

Here we consider the factorization of X into AB in the spirit that it has 
no real importance in and of itself other than as a computationally convenient 
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means for obtaining a reduction in the number of variables. Of course in 
some situations in practice once the factorization has been made, attention 
will turn to the interpretability of the metavariables. 

The latter consideration has led to much recent interest in the use of 
NMF in the analysis of data for which the elements are nonnegative. It 
constrains the elements of the factor matrices A and B to be nonnegative, 
which can be advantageous from the point of view of interpretability. Lee and 
Seung, 1999; Lee and Seung 2001 developed NMF in order to improve upon 
the interpretability of the SVD. The nonnegativity constraints on A and B 
form a whole in a nonsubtractive way. In this way, NMF is considered as a 
procedure for learning a parts-based representation (Lee and Seung, 1999). 
However, as pointed out in Li et al. (2001) the additive parts by NMF are 
not necessarily localized. This led them to propose a subspace method, 
called local nonnegative matrix factorization (LNMF) for learning spatially 
localized, parts-based representation of visual patterns; see also (Donoho et 
al., 2004; Gao et al., 2005; Gogel et al., 2007) and the recent monograph 
(Cichocki et al., 2010). 

More recently, Ding et al. (2010) has considered variations of NMF where 
the elements of A, but not of B, are constrained to be nonegative, and so 
allowing the data matrix X to have mixed signs (semi-NMF). They also 
consider algorithms in which the basis vectors of A are constrained to be 
convex combinations of the data points. In other work, Witten et al. (2009) 
have proposed a penalized matrix decomposition for computing a g-rank 
approximation to X. 

3. Gradient-Based Algorithm for GMF 

We now describe our gradient-based algorithm for carrying out the general 
matrix factorization ([1]) of the data matrix X. The objective function to be 
minimized is given by 



where = Xij — X]/=i a i/^/j5 an d ^ is the loss function assumed to be 
differentiable with derivative denoted by ip. For illustrative purposes, we 
take \l/ to be a member of the exponential family of loss functions given by 




p 




\l/(x; a) = 2 



(cosh(ax) — 1) 



a 2 (exp(ax) + exp(— ax) — 2) 



(6) 
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where a is a regularization parameter. Note that a squared loss function may 
be regarded as a marginal limit in relation to this family of loss functions 
since 



In our initial experiments Nikulin and McLachlan (2009), we tried a range 
of values between 0.003 and 0.004 for a, which gave similar results as for the 
squared loss function. 

The algorithm can be implemented as follows. 

Gradient-based framework for matrix factorization 

1: Input: X - matrix of microarrays. 

2: Select m - number of global iterations; q - number of factors; A > - 
initial learning rate, < £ < 1 - correction rate, L$ - initial value of 
the target function. 

3: Initial matrices A and B may be generated randomly. 

4: Global cycle: repeat m times the following steps 5 - 17: 

5: genes-cycle: for i — 1 to p repeat steps 6 - 15: 

6: tissues-cycle: for j — 1 to n repeat steps 7 - 15: 

7: compute prediction S = X)/=i a ifbfj] 

8: compute error of prediction: E = Xij — S; 

9: internal factors-cycle: for / = 1 to q repeat steps 10 - 15: 

10: compute a = ctifbfj] 

11: update <= + \ip(E)bfj; 

12: E <= E + a - aj/ &/_,-; 

13: compute a = a^bfj] 

14: update bfj <= bfj + Xip(E)aif; 

15: E <= E + a — aifbfj] 

16: compute L = L(A,B); 

17: L,s = L if L < L,s; otherwise: A •<= A • £. 

18: Output: A and B - matrices of loadings and metagenes. 

The following partial derivatives are necessary for the above algorithm 
(see steps 11 and 14 above): 



lim ^(x; a) = x 2 . 



(7) 



da if 



i>(E i:j )a if . 



(9) 



(8) 
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The target function ([5J) that needs to be minimized includes a total of 
q(p + n) regularization parameters. The algorithm loops through all the 
differences Eij, minimizing them as a function of the elements of the two 
factor matrices A and B. If the optimization were to be performed by fixing 
on B and solving the optimization with respect to A and then reversing the 
roles of the variables with the intention to iterate until convergence, there 
can be difficulties with convergence given that the two factor matrices are 
completely unconstrained. We circumvent this problem by iterating on only 
some of the elements of A before iterating on some of the elements of B. 
This partial updating of A before a switch to a partial updating of B is very 
effective and is responsible for the very fast convergence of the process. 

This procedure of role reversal between the elements of the two factors 
matrices after only partial updating of their elements has been used effectively 
in the context of a recommender system; see, for example Paterek (2007) and, 
more recently, Koren (2009), who used factorization techniques to predict 
users' preferences for movies on the Netflix Prize data set. 

It is noted that in the case of the squared loss function, we can optimise 
the value of the step-size. However, taking into account the complexity of 
the model, we recommend maintaining fixed and small values of the step 
size or learning rate. In all our experiments we applied our algorithm using 
100 global iterations with the following regulation parameters. The initial 
learning rate A was set at 0.01, while the correction rate £ rate was set at 
0.75. The convergence of the algorithm is illustrated in Figure [T] for GMF 
applied to the 2000 x 62 data matrix X for three data sets, including the colon 
cancer data set of Alon et al. (1999) with n = 62 tissues and p = 2, 000 genes. 
The other cancer data sets (leukaemia and lymphoma) are to be described in 
the next section. As pointed out in the introductory section, our algorithm 
takes only between 10 and 15 seconds in performing 300 global iterations 
to provide a q = 11 rank factorization of this data matrix compared to 
around 25 minutes to perform 20 global iterations with non-negative matrix 
factorization. We used a Linux computer with speed 3.2GHz, RAM 16GB 
with the algorithm using special code written in C). 

4. Application of GMF in Supervised Classification 

In the sequel, we focus on the performance of GMF in its application 
to some data sets in the context of supervised classification (discriminant 
analysis). In this latter context, we have an obvious criterion to guide in the 
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choice of the number q of metavariables, namely the estimated error rate of 
the classifier. 

Concerning suitable estimates for the error rate of a classifier, we in- 
troduce the following notation. It is assumed that the observed data points 
Xi, . . . , x n come from g possible classes, C\, . . . , C g , with known class labels 
specified by z, where 

Z (zi, . . . , , 

and where Zj is a ^-dimensional vector of zeros or ones with its zth element, 
Zij, defined to be one if Xj comes from class Ci, and zero otherwise (i = 
1, . . . , g; j = 1, . . . , n). For the allocation of an observation x Q to one of the 
g possible classes, we let r(x ; X, z) be a classifier formed from the training 
data X with its known class labels in z, where r(x ; X , z) equal to i implies 
that x Q is assigned to class C{(i — 1, . . . , g). We shall henceforth abbreviate 
r(x ; X , z) to r{x Q \ X) Also, we let e(X) denote an estimate of the error 
rate of r(x ; X, z), where dependency of this estimate on z is also suppressed 
for brevity of expression. If we use, for example, n-fold cross-validation (that 
is, the leave-one-out estimate), then 

a n 
i=i j=i 

where the function H[u,v] is defined to be equal to 1 if u ^ v, and zero 

otherwise and where X^ denotes X with Xj deleted. Finally, we let B^ q \x) 
denote the solution for B when GMF is applied to X for a specified value 
of q. 

In the case where the full data matrix X is replaced by the reduced matrix 

B 9 (X) computed for a specified q, we can use (fTUI) to estimate the expected 
error rate of the classifier formed from this reduced set. An estimate is given 
by 

e 1 (B iq \x)) = n-iJ2iL z v H ^ r ( b f>Eu ) (Xm (H) 

i=i j=i 

where bj is the jth column of B^\x) and B^(X) denotes B 9 \x) with 
its jth column bj deleted. 

As pointed out by Ambroise and McLachlan (2002), this estimate will 
provide an optimistic assessment of the true error rate of the classifier, since 

the reduced data matrix B^ q \x) should be recomputed on each fold of the 
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cross-validation; that is, in the right-hand side of ( TIT]) . B^(X) should be 

replaced by B 9 (Xy)), the reduced data matrix obtained by applying the 
GMF algortihm to Xw, the data matrix X with its jth column deleted. 
This estimate can be written as 

g n 

e 2 (B {q \x))=n-^J2 z ^ r ^B {Q \x U) ))], (12) 
i=i j=i 

In order to calculate this estimated error rate with n-fold cross-validation, 
it means that the GMF algorithm has to be run n times in addition to its 
replication to the full data set. This is feasible given the speed with which 
the algorithm carries out the GMF. It should be pointed out that since the 
GMF does not make use of the known class labels, the selection bias of the 

classifier based on the selected subset of metavariables B^ will not be nearly 
as great in magnitude as with selection methods that use the class labels. 
Also, in practice, n-fold cross validation can produce an estimate with too 
much variability and so five- or ten-fold cross validation is often used in a 
variance versus bias tradeoff (Ambroise and McLachlan, 2002). 

We can choose the final value of q by taking it to be the value q Q that 

minimizes the estimated error rate e2(B^ q \x)); 

q Q = argmine 2 ( J B (9) (X( i) ), (13) 

q&Q 

where Q denotes the set of values considered for q. However, there is still a 
selection bias if we use 

g n 

e 2 (B iqo \x))=n-^Y, z ^ r ^B {q °\x u) ))], (14) 
i=i j=i 

to estimate the error rate of the classifier based on the reduced set with the 
smallest error rate over the values of q considered; see, for example, (Wood 
et al., 2007; Zhu et a!., 2008). We can correct for this bias by using the 
estimate 

9 n 

e 3 (B (qo) ) = n-^^z^^rtb^^fXy))], (15) 



=1 j=l 



where 



q oj = argmm^^ — , (16) 



qeQ z — ' * — ' n 
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and Xujr) denotes the data matrix X with Xj and Xy deleted. 

It can be seen from ( I16p that in order to calculate the cross-validated 
estimate f lT5|) . we need to perform the GMF n(n — 1) times in addition to the 
original application to the full data set X. This is still feasible since GMF can 
be implemented so quickly, although the total computational time becomes 
large as n increases. As noted above, using, say, ten-fold cross-validation 
would reduce the number of times that GMF has to be employed. In the 
data sets considered here, the increase in the estimated error rate given by 

the use of e^B^) over (j!4p was very small (not of practical significance). 

5. Supervised Classification of Some Cancer Data Sets 

We shall demonstrate the application of the GMF for dimension reduction 
in the context of supervised classification of five cancer data sets that have 
been commonly analysed in the bioinformatics literature, as briefly described 
in the following section. 

5.1. Five Data Sets 

For the colon data set (Alon et al., 1999) the data matrix X contains the 
expression levels of p=2000 genes in each of n=62 tissue samples consisting 
of rii =40 tumours and ?i2=22 normals. 

The data matrix for the leukaemia data set (Golub et al., 1999) contains 
the expression levels of p = 7129 genes for each of n = 72 patients, consisting 
of rii=47 patients suffering from acute lymphoblastic leukaemia (ALL) and 
n 2 = 25 patients suffering from acute myeloid leukaemia (AML). 

We followed the pre-processing steps of (Golub et al., 1999) applied to the 
leukaemia set: 1) thresholding: floor of 1 and ceiling of 20000; 2) filtering: 
exclusion of genes with max/min < 2 and (max - min) < 100, where max 
and min refer respectively to the maximum and minimum expression levels 
of a particular gene across the tissue samples. This left us with p = 1896 
genes. In addition, the natural logarithm of the expression levels was taken. 

The data matrix for the lymphoma data set (Alizadeh et al., 2000) con- 
tains the gene expression levels of the three most prevalent adult lymphoid 
malignancies: n\= 42 samples of diffuse large B-cell lymphoma (DLCL), n 2 = 
9 samples of follicular lymphoma (FL), and n 3 = 11 samples of chronic lym- 
phocytic leukaemia (CLL). The total sample size is thus n= 62 and there are 
p = 4026 genes. 
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The Sharma data set was described in (Sharma et al., 2005) and contains 
the expression levels (mRNA) of p= 1368 genes for each of 60 blood samples 
taken from 56 women. Each sample was labelled by clinicians, with ni=24 
labelled as having breast cancer and n 2 = 36 labelled as not having it. Some 
of the samples were analysed more than once in separate batches giving a 
total of n= 102 labelled samples. 

The fifth data set (Khan et al., 2001) contains the expression levels of 
p= 2308 genes for each of n = 83 tissue samples, each from a child who was 
determined by clinicians to have a type of small round blue cell tumour. This 
includes the following g=4 classes: neuroblastoma (N), rhabdomyosarcoma 
(R), Burkitt lymphoma (B) and the Ewing sarcoma (E). The numbers in 
each class are: N(ni=18), R(?t, 2 =25), B(n 3 = 11), and E(n 4 =29). 

We applied double normalization to each data set. Firstly, we normalized 
each column to have means zero and unit standard deviations. Then we 
applied the same normalization to each row. 

5.2. Error rates for classifiers formed on the basis of metagenes 

In Figure [21 we plot the cross- validated error rate e\ versus the number of 
metagenes q for four of the five data sets, using the support vector machine 
(SVM) in the case of g = 2 classes and (multinomial) logistic regression (LR) 
in the case of g > 2 classes. We did not plot e x for the leukaemia data set as 
it was close to zero if q > 10. 

In Tabled], we list the cross- validated error rates e\ and its bias-corrected 
version e 2 for each of the four data sets, where the classifier (SVM or MLR) 
is formed on the basis of q metagenes. To give some guide as to the level 
of performance of the performance of these classifiers, we also list the value 
of the error rate using the nearest- shrunken centroids method (Tibshirani et 
al., 2002). The bias-corrected error rate e 2 is smaller than that of the NSC 
method for all but one of the data sets (the lymphoma set). The estimated 
error rate for the nearest-shrunken method corresponds to e 2 in that it can 
be regarded as an almost unbiased estimate for a given subset of the genes, 
but it has not been corrected for bias over the set of values q considered; see 
Wood et al. (2007) and Zhu et al. (2008). 

On the question of which metavariables (metagenes) are useful in the dis- 
criminatory process, an inspection of the heat maps (coloured lots of each 
metagene value for each tissue) can be useful, but not always. To illustrate 
this, we have plotted the heat maps in Figure [3] for three of the data sets. In 
this figure, we have sorted the tissues into their classes in order to consider 
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visual differences between the patterns. In the case of the colon data in Fig- 
ure El^a), we cannot see clear separation of the negative and positive classes. 
In contrast, in the case of the leukaemia data in Figure [S^b), metagene N2 
separates the first 47 tissues (from the top) from the remaining 25 tissues 
with only one exception. It is tissue 58, which is the only one misclassified 
tissue in Table [T] (cases q = 3,4). Similarly, in the case of the lymphoma 
data in Figure EIc), metagene Nl separates clearly CLL from the remaining 
two classes. Further, metagene N3 separates DLCL from the remaining two 
classes. 

To assist with the interpretation of the metagenes, we can examine the 
Gene Ontology (GO) (The GO Consortium, 2009) and the pathway records 
of the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et 
al., 2010) for those genes that have high (absolute) correlations with the 
metagenes. 

6. Conclusions 

We have presented an algorithm for performing extremely fast general 
matrix factorization (GMF) of a high-dimensional data matrix X. In prac- 
tice some form of dimension reduction is invariably needed if standard or 
even nonstandard methods of statistical analysis are to be employed to gain 
meaningful insight from high-dimensional data matrices. The algorithm un- 
dertakes the factorization using gradient-based optimization for an arbitrary 
(differentiable) loss function. The p x n data matrix X is approximated by 
the product of two matrices, AB, where the q x n factor matrix B can be 
used in place of X for a specified value of q taken to be much smaller than 
p. The n columns of B contain the values of the q metavariables for each of 
the n observed data vectors. The stability of the algorithm depends essen- 
tially on a properly selected learning rate, which must not be too big. We 
can provide additional functions so that the learning rate will be reduced or 
increased depending on the current performance. 

To demonstrate the usefulness of the reduced data matrix B in the con- 
text of supervised classification, we applied it to the data matrices from five 
cancer data matrices of microarray gene expressions that have been com- 
monly analysed in the medical and scientific literature. The classification 
of the microarrays (tissue samples) containing these gene expressions are of 
known classification with respect to g classes, where g varies from 2 to 4. 
The results suggest that GMF as implemented by our algorithm is effective 
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in providing a reduced data matrix for their subsequent use in forming a 
classifier that was taken to be either the SVM in the case of g = 2 classes or 
logistic regression for g > 2 classes. 

Some main issues associated with the use of GMF are the choice of 
the number q of metavariables (latent factors), the interpretability of the 
metavariables, and the need for prefiltering of the variables before the fac- 
torization. These issues are beyond the scope of this paper. 

But briefly on these issues here, the choice of the number of metavariables 
q in supervised classification can be based on the estimated error rate of the 
classifier formed from the specified number of metavariables. The situation 
is not as straightforward in the context of cluster analysis where there are 
no training data of known origin even if there were some a priori knowledge 
about the existence of some group structure in the data. One way to proceed 
in this context is to use the stability in the clustering as the level of q is varied 
as a guide to its final choice (Brunet et al., 2004, Tamayo et al., 2007). The 
question of whether there is a need for prefiltering in the context of cluster 
analysis has been considered recently by Zheng et al. (2009). 

The problem of interpretability of the metavariables is generally not as 
straightforward as with NMF's since the latter are non-subtractive combina- 
tions of the original variables (Zheng et al., 2009). In general, we can cal- 
culate the correlations between the original variables and the metavariables. 
For microarray data, we are currently designing a program that automatically 
attempts to establish links between genes highly correlated with a metagene 
and the Gene Ontology (The GO Consortium, 2009) and the the pathway 
records of the Kyoto Encyclopedia of Genes and Genomes (Kanehisa et al., 
2010). 

References 

Lee, D.D., Seung, H.S., 1999. Learning the parts of objects by non-negative 
matrix factorization. Nature 401, 788-791. 

Golub, G., van Loan, C, 1983. Matrix Computations. Johns Hopkins Uni- 
versity Press, Baltimore. 

Eckart, C, Young, G., 1936. The approximation of one matrix by another 
of low rank. Psychometrika 1, 211-218. 



12 



Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., 
Levine, A. J., 1999. Broad patterns of gene expression revealed by clustering 
analysis of tumor and normal colon tissues probed by oligonucleotide arrays. 
Proceedings of the National Academy of Sciences USA 96, 6745-6750. 

Lee, D.D., Seung, H.S., 2001. Algorithms for non-negative matrix factoriza- 
tion. Advances in Neural Information Processing System 13, MIT Press. 

Li, S.Z., Hou, X.W., Zhang, H.J., Cheng, Q.S., 2001. Learning spatially 
localized parts-based representation. In Proceedings of IEEE International 
Conference on Computer Vision and Pattern Recognition Volume I, Hawaii, 
pp.207-212. 

Donoho, D., Stodden, V., 2004. When does non-negative matrix factorization 
give a correct decomposition into parts ? In Advances in Neural Information 
Processing Systems 16, Cambridge, MA: MIT Press. 

Gao, Y., Church, G., 2005. Improving molecular cancer class discovery 
through sparse non- negative matrix factorization. Bioinformatics 21, 3970- 
3975. 

Fogel, P., Young, S.S., Hawkins, D.M., Ledirac, N., 2007. Inferential, robust 
non-negative matrix factorization analysis of microarray data. Bioinformat- 
ics 23, 44-49. 

Cichocki, A., Zdunek, R., Phan, A.H., Amari, S-L, 2010. Nonnegative Ma- 
trix and Tensor Factorizations. Wiley, Chichester 

Ding, C, Li, T., Jordan, M.I., 2010. Convex and semi-nonnegative matrix 
factorizations. IEEE Transactions on Pattern Analysis and Machine Intelli- 
gence 32, 45-55. 

Witten, D.M., Tibshirani, R., Hastie, R., 2009. A penalized matrix de- 
composition, with applications to sparse principal components and canonical 
correlation analysis. Biostatistics 10, 515-534. 

Nikulin, V., McLachlan, G.J., 2009. On a general method for matrix factor- 
ization applied to supervised classification. In Proceedings of 2009 IEEE In- 



13 



ternational Conference on Bioinformatics and Biomedicine Workshop, Wash- 
ington, D.C. , J Chen et al. (Eds.). Los Alamitos, California: IEEE, pp. 

43-48. 

Paterek, A., 2007. Improving regularized singular value decomposition for 
collaborative filtering. KDD Cup , San Jose, CA: ACM, pp. 39-42. 

Koren, Y., 2009. Collaborative filtering with temporal dynamics. In Pro- 
ceedings of the 15th ACM SIGKDD Conference on Knowledge Discovery 
and Data Mining , Paris, pp. 447-455. 

Ambroise, C, McLachlan G.J., 2002. Selection bias in gene expression on 
the basis of microarray gene expression data. Proceedings of the National 
Academy of Sciences USA 99, 6562-6566. 

Wood, I., Visscher, P., Mengersen, K., 2007. Classification based upon ex- 
pression data: bias and precision of error rates. Bioinformatics 23, 1363- 
1370. 

Zhu, J.X., McLachlan, G.J., Ben-Tovim, L., Wood, I., 2008. On selection 
biases with prediction rules formed from gene expression data. Journal of 
Statistical Planning and Inference 38, 374-386. 

Golub, T.R., Slonim D.K., Tamayo, P., Huard, C, Gassenbeck, M., Mesirov, 
J.P, Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al. 1999. 
Molecular classification of cancer: class discovery. Science 286, 531-537. 

Dudoit, S., Fridlyand, J., Speed, T.P., 2002. Comparison of discrimination 
methods for the classification of tumors using gene expression data. Journal 
of the American Statistical Association 97, 77-87. 

Alizadeh, A., Eisen, M.B., Davis, R.E., Ma, C, Lossos, I.S., Rosenwal. A., 
Boldrick, J.C., Sabet, H., Tran, T., Yu, X., et al. 2000. Distinct types of 
diffuse large B-cell lymphoma identified by gene expression profiling. Nature 
403, 503-511. 

Sharma, P., Tibshirani, R., Skaane, P., Urdal, P., Berghagen, H., 2005. Early 
detection of breast cancer based on gene-expression patterns in peripheral 



14 



blood cells. Breast Cancer Research 7, R634-R644. 

Khan, J., Wei, J., Ringner, M., Saal, L., Ladanyi, M., Westermann, F., 
Berthold, F., Schwab, M., 2001. Classification and diagnostic prediction of 
cancers using gene expression profiling and artificial neural networks. Nature 
Medicine 7, 673-679. 

Tibshirani, R., Hastie, T., Narasimhan, B., Chu, C, 2002. Diagnosis of mul- 
tiple cancer types by shrunken centroids of gene expression. Proceedings of 
the National Academy of Sciences USA 99, 6567-6572. 

The GO Consortium. 2009. Gene ontolog [Online]. Available: 
http : //www . geneontology . org 



Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M., Hirakawa, M., 2010. 
KEGG for representation and analysis of molecular networks involving dis- 
eases and drugs. Nucleic Acids Research 38, D355-D360. 

Brunet, J-P., Tamayo, P., Golub, T.R., Mesirov, J. P., 2004. Metagene pro- 
jection for cross-platform, cross-species characterization of global transcrip- 
tional states. Proceedings of the National Academy of Sciences USA 101, 
4164-4169. 

Tamayo, P., Brunet, J-P., Scanfeld, D., Ebert, B.L., Gillette, M.A., Roberts, 
C.W.M., Mesirov, J. P., 2007. Proceedings of the National Academy of Sci- 
ences USA 104, 5959-5964. 

Zheng. C, Huang, D., Zhang, L., Kong, X., 2009. Tumor clustering using 
nonnegative matrix factorizations with gene selection. IEEE Transactions on 
Information Technology in Biomedicine 13, 599-607. 



15 



Table 1: Some selected experimental results, where numbers in brackets in the first column 
indicate numbers of classes in the corresponding data set, and numbers of misclassificd 
entries in the fourth, fifth and sixth columns. Results in the sixth column "NSC" were 
obtained using nearest-shrunken centroids method with threshold parameter A as it was 
described in Tibshirani et al., (2002). The column p s indicates the number of used/selected 
features. 



Data 


Model 


q 


ei 


e 2 


NSC 


Ps 


A 


Colon (2) 


SVM 


8 


0.0806 (5) 


0.1129 (7) 


0.129 (8) 


141 


1.3 


Leukaemia (2) 


SVM 


25 


0(0) 


0.0139 (1) 


0.0139 (1) 


73 


1.9 


Lymphoma (3) 


MLR 


10 


0.0322 (2) 


0.0322 (2) 


0.0161 (1) 


3336 


0.8 


Breast (2) 


SVM 


18 


0.1 (6) 


0.15 (9) 


0.2 (12) 


53 


1.2 


Blue cell tumour (4) 


MLR 


21 


0.0241 (2) 


0.0482 (4) 


0.0602 (5) 


464 


1.8 
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global Iteration 



Figure 1: Behaviour of the target ([5]) with squared loss as a function of global iteration 
for for q = 10 metagenes; dashed blue, solid black and dot-dashed red lines correspond to 
the colon, leukaemia and lymphoma cases. 
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Figure 2: The estimated errror rate ei as a function of the number q of metagenes. 
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(a):metagene (colon) (b):metagene (leukaemia) (c):metagene (lymphoma) 



Figure 3: Images of the matrix B for q — 5: (a) colon (sorted from the top: 40 positive 
then 22 negative), (b) leukaemia (sorted from the top: 47 ALL, then 25 AML) and (c) 
lymphoma (sorted from the top: 42 DLCL, then 9 FL, last 11 CLL). All three matrices 
were produced using the GMF algorithm with 100 global iterations as described in the 
text. 
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